Scraping examples

Some BeautifulSoup examples to help you scrape the HTML from web pages.

See BeautifulSoup documentation for technical reference.

1. Import libraries


In [ ]:
# Import all the things!
import urllib.request
from datetime import *
from lxml import html
from bs4 import BeautifulSoup

2. Define scraping functions


In [ ]:
# Scrape all HTML from webpage.
def scrapewebpage(url):
	# Open URL and get HTML.
	web = urllib.request.urlopen(url)

	# Make sure there wasn't any errors opening the URL.
	if (web.getcode() == 200):
		html = web.read()
		return(html)
	else:
		print("Error %s reading %s" % str(web.getcode()), url)

# Helper function that scrape the webpage and turn it into soup.
def makesoup(url):
	html = scrapewebpage(url)
	return(BeautifulSoup(html, "lxml"))

3. BeautifulSoup examples

A) Find an id

Use find(id="name") to find the first HTML tag that has an id attribute like this: <h2 id="mp-itn-h2"></h2>


In [ ]:
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")

In [ ]:
# Match the <h2> tag with id the id mp-itn-h2
h2 = wp_soup.find(id="mp-itn-h2")

h2

In [ ]:
# Only get the text inside <h2>.
h2.get_text()

B) Find a class

Use find("", "name") to find the first HTML tag that has a class attribute like this: <h2 class="name"></h2>


In [ ]:
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")

In [ ]:
# Find the first HTML tag that has class mw-headline.
headline = wp_soup.find("", "mw-headline")

headline

In [ ]:
# Only get the text inside the <span>.
headline.get_text()

C) Find everything with a class

Use find_all("", "name") to find all HTML tags that has a class attribute like this: <h2 class="name"></h2>


In [ ]:
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")

In [ ]:
# Find all HTML tag that has class mw-headline.
all_headlines = wp_soup.find_all("", "mw-headline")

all_headlines

In [ ]:
# Now we have a list that we can use a for loop.
for headline in all_headlines:
    headline = headline.get_text()
    print(headline)

D) Find all <h3>

Use find_all("h3") to get all <h3> (or something else).


In [ ]:
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")

In [ ]:
# Find all HTML tag that has class mw-headline.
all_h3 = wp_soup.find_all("h3")

all_h3

In [ ]:
# Now we have a list that we can use a for loop.
for h3 in all_h3:
    h3 = h3.get_text()
    print(h3)

E) Find all column values in a table

Use for to loop through a <table> and extract all column values.


In [ ]:
# Scrape a Wikipedia page with a table.
champ_soup = makesoup("https://en.wikipedia.org/wiki/European_Road_Championships")

In [ ]:
# Find <table class="wikitable">.
table = champ_soup.find("table", "wikitable")

table

In [ ]:
# Go through each row and take the text from 1st and 2nd column.
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if len(cols) > 0:
        Year = cols[0].get_text()        # Get the text in the 1st column.
        Country = cols[1].get_text()     # Get the text in the 2nd column.

        print(Year + " " + Country)

F) Find a specific cell value in a table

Use this if you know the row and column of the <table> of the information you want to extract.


In [ ]:
table = champ_soup.find("table", "wikitable")

# Get cell value from row 5, column 1.
cell = table.find_all('tr')[5].find_all('td')[1].get_text()

cell

G) Find nested things

Use this if you want to find things that are nested inside each oher.


In [ ]:
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")

In [ ]:
# Find <table class="mp-middle">.
middle_table = wp_soup.find("table", id="mp-middle")

# In the <table>, find <h2>.
h2 = middle_table.find("h2")

h2

In [ ]:
# Only get the text inside the <h2>.
h2.get_text()